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Abstract 

riiysical systems arc inherently parallel; intuition suggests that simulations of these systems 
may he amenahle to parallel execution. The parallel execution of a discrete-event simulation 
refpiires careful synchronization of processes in order to ensure the execution’s correctness; this 
.syiichronization can degrade performance. Largely negative results were recently reported in a 
study which ii.sed a well-known synchronization method on queueing network simulations. In 
this paper we discu.ss a synchronization method, appointments, which has proven itself to be 
<‘fr<’c.tiv(! on siinulal.ions of I'(’I''S queueing networks. The key concept behind appointments is 
tlm provision of lookahead. Ijookahead is a prediction on a proce.ssor’s future behavior, based 
on a.n analysis of the processor’s simulation .slate. We show how lookahead can be computed 
for FCh’S (|ueucing network simulations, give performance data that demonstrates the method’s 
elfcctiveiiess under moderate to heavy loads, and discuss performance trade-offs between the 
(pi.ality of lookahead, and the cost of computing lookahead. 


*To appear in the Proceedings of the ACM SIGPLAN Symposium on Parallel Programming, Environments, 
Applications, and Languages, Yale University, July 1988. 

*Tliis re.search was supported in part by the National Aeronautics and Space Administration under NASA contract 

NASI-ISKI7 while the author consulted at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, VA 
'j.iric.,';. 
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1 Introduction 


IMiysic:il systoms arc inlicrciitly parallel; intuition suggests that simulations of these systems may 
he ameiiahle to parallel execution. The parallel execution of a discrete-event simulation [2] requires 
careful syuchroiiixation of processes in order to ensure the execution’s correctness. A number of 
synchronization methods have been proposed; some have been studied empirically. With few excep- 
tions the evidence is that overhead inherent in these methods prevents any significant performance 
IxMiefit from parallel execution. 

(Queueing network simulations provide a stress test for parallel discrete-event simulation because 
so little computation is associated with each event. Parallel queueing network simulations are also 
interesting from a historical point of view, as much of the early work in this field implicitly uses 
a. <ineueing network model for the simulation. The seminal work in parallel simulation by Chandy 
a nd Misra[l] identified the concept of lookahead as being sufficient to avoid logical deadlock between 
processors. Lookahead is the ability of a process to predict (possibly minutely) those aspects of its 
futur<‘ l)eha.vior whicli affect the synchronization requirements of other processes. Implementations 
of the Cliandy/Misra. algorithms invariably create a lookahead ability by requiring that each job 
r<'ceive a. minimum service time c. Knowledge that a future job requires at least e service allows 
a. processor to i)redict that a job which arrives immediately will not depart for at least e time. 
Mecause most |)rob:ibiIity distributions of interest are not bounded from below, implementations 
must choose c to be very small. Performance studies[6,14] have strongly suggested that this poor 
lookaliead ability leads to dismal performance due to extremely high synchronization overhead. 

In [11] w(> proposed that more extensive lookahead be calculated by analyzing a process’s simu- 
lation state;, and showe'd how this could be accomplished in both queueing network simulations, and 
logic network simulations. In [12] we examined the effect that increased lookahead has on overall 
performance. More recently Fuji mol o re-examined the Chandy/Misra algorithms and focused on 
increasing lof)kahead ability by increasing r. His results are more encouraging, but poor perfor- 
mance is still observed when the ratio of mean service time to c is high (say, 10). Lubachcvsky[9] 
also uses lookahead which is computable under the assumption of minimum service times. While 
he does iK)t rc'port any empirical residts, one can expect his scheme to suffer from similar failings as 
the (Miandy/Misra algorithms as reliance on small minimum service times has already been shown 
1.0 yield poor perforinance. 
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Figure J: Distributed Simulation of Three Queues 


'I'ho j)urpos<^ of this paper is to point out the feasibility of using detailed simulation-specific 
information to compute lookahead in stochastic FCFS queueing network models. Unlike past 
treatments of parallel queueing network simulations this lookahead does not rely upon a minimum 
service time. We discuss the trade-offs between lookahead quality and the cost of computing it, 
ami use a parallel implementation of our method to show that under moderate to heavy loads a 
protocol based on lookaheaxl can yiehl good performance on simulation models that have defeated 
other protocols. Fu jimoto has independently performed a similar study^. 

Fvery processor in a. paralUd discrete-event simulation maintains its own simulation clock, and 
its own ev«Mit list. A simple exami»le clearly illustrates the need for synchronization. Figure 1 
(b'picts the simulation of a thr<’e (pieue network on three processors. Queue Q\ sends a job to 
queim with a time-stamp of 10. 'I’lie first event in <^ 2 ’® f^vent list is the one which accepts 
this job. Simulation correctness is ensured if, within every proc.e.ssor, the simulation time order 
of evaluated events is inonotonically increasing, 'lb ensure this monotoiiicity Q 2 does not process 
the firyt event in its event list until it is certain that some other event with a smaller time-stamp 
will never be ins(irted into the event list. Such an event might occur, for e.xample, if at time 1 an 
i;xternal arrival appears at Q-^, is given 2 units of service, and then is routed to Q 2 . The role of a 
' I’livnl.c roiimuiiiicalion from Richard t'lijiinoto. 
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consfirvativc syncliroiiization protocol is to coordinate Qi, Q 2 , and Q 3 so every processor evaluates 
events moiiolonically in simulation time, so that a processor evaluates the first event in its list 
as soon as it is safe to do so, and so that system deadlock is avoided (or detected /corrected). It 
slioul<l Ix' mentioned that optiTnislic synchronization is currently a topic of active study[7]; under 
an optimistic protocol Q 2 would process the first event in its list with the expectation that no job 
with a sma ll(T time sta mp will appear. If one does appear, then corrective measures must be taken. 
'I'lie protocol di.scussed in this paper is conservative. 

Discrete-event simulation synchronization protocols are typically described in terms of message 
passing behavior between logical processes (LP)'s, or the subsystems modeled by processors. Asso- 
ciated with each L/'is a set of rccu/crs and a set of writers. LP{ is a writer for LPj if it is possible for 
tin' processing of an event in LI\ to cause a “message” to be sent to LPj, who in turn modifies the 
event list in LPj. In tliis case LPj is a reader for LP{. It is useful to distinguish between “content” 
and “|)rotocol” mc.ssages. As the titles suggest, a content message directly concerns the simulation 
and its state while a protocol message concerns only the implementation of the synchronization 
protocol. In the example above a content message from Qi’s processor to Q 2 ’s processor causes 
tli(^ insertion of the event in processor’s event list. At some point Q 3 might send a protocol 
message I 0 Q 2 promising that it will send no Jobs with a time-stamp less than 15 (although we have 
not yet identified how Q.-j can provide such a promise), thereby allowing Q 2 's processor to evaluate 
th(^ arrival at time 10. Protocol messages may themselves be time-stamped. 

In |>revious protocols [13,1,15] a protocol message from LP{ to LPj with a time-stamp of t 
provides a. promise that LPi's next message to LPj (a message which may cause modification of 
LPj's event list) will have a lime-stamp no greater than t. The established protocols vary in their 
details, but all share a distinctive characteristic: the protocol mechanism is largely independent 
of tli<' system being simidated. 'I'his generality is attractive, but requires that an LP’s decision to 
send a |)rotocol me.ssage with a time-stami) of / is based solely on the time-stamps of protocol and 
content messages that the I,P has rc'ceived. To ensure the protocol’s generality information about 
th(^ simidation state, or how an LP responds to a content message is not used. As a consequence 
many protoc(»l messages must be exchanged, as each protocol message allows the simulation to 
precede only incrementally. .Studies of the “Null Message” method have shown that the ratio of 
protocol me.ssages to content messages is very high. Reed’s recent empirical study of this method 
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on <iin'ii('iiiR network sinmlations shows it to be of limited utility[M]. 

'I’he role of a |)rolf)col message from (■o is to provide a lower bound on the simulation 
time at which !,l\ may next affect A/’/s event list. 1’he quality of this bound depends on LPi\ 
ability to predict its future behavior. In the quest for generality, the previous synchronization 
|)rotocols fail to take advantage of knowlerige about the simulated syst(>m. A bettor bound on 
fiitiir<’ behavior can be obtained by analyzing the A/^s simulation state to find lookahead. The 
section lo follow outlines a. synchronization protocol that relies upon the conipntatiou of simulation- 
spj'cific lof>kahead. 

2 The Appointment Protocol 

Ibifore discussing means of idenl ifying lookahead wo will introduce the synchronization protocol that 
uses it. A small tiumber of definitions must first be given. AP,’s simulation clock is denoted C,; 

event list is denoted A',, and is assumed to be ordered by increasing time-stamps, e,- denotes 
till* event at the head of /v’,, and /, denotes its time-stamp. We assume the usual relationship 
between f, arid A', Just prior to |)rocessing the event e,- with time-staini) /,, C, is advanced to /,. 

A serial simnlatiun r<'peatedly executes a three-step cycle: advance the simulation clock to the 
time stamp /, of the first event in the event list c,-, process Cj (which may alter the event list, 
but will iK'ver a<ld ev<>nts with tinie-stamps less than /,), and remove the event just processed. 
I,l\ in a parallel simulation must not process e,- until it is certain that none of AA,’s writers will 
caiis<> an earlier event than c, to be inserted into A,-. The mechanism we use to prevent LPi from 
processing an evcuit ‘M.oo <'arly" is the nppointme.nl. Every one of APi’s writers provides LPi with 
an appruntmeni time l)eyond which A/’, will not advance its clock without further permission. An 
appointment tha t A/’, giv<*s is denoted .djj; we denote the set of all appointments given to LPj 
by {W,}. 0 idy an A/”s writers must supply it with appointment time's. 

rigur<'‘2 gives high level ()seudo-code d«>scribing the appointment protocol. We have left unspec- 
ified otlu-r lu’cessary luechanisms, e.g. asynchronous message-passing routines to update appoint- 
nieid. values and modify the event list. For clarity we have also left nns[)ecifi('d direct optimizations 
whie h ensure* that a m-w appoint me'iit is not re'que'sfeHl before the last such request was satisfieel. 
I, ike all e einseTvative' synchronizatieui [uotercols, this one prevents the proce'ssing of an event if there 
is any chance that an eveid with a smaller time stamp will be insertcel into the event list. 
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Definitions 


C, Value of sinnilatioii dock 

Ei LPi'a event. list. 

f.', First, event on P'i 

fi Time-stainp of first event on Ei 

Af;i A|)poiiitmont provided by writer to reader LPi 

inin{VF,} Minimnin over all appointments given to LP{ by its writers 


Loop { 

lf( /, < min{iy.} ) 

{ = /.•; 

Process event c,-; 

Remove c, from P,; 

} 

KIse 

{ Tor every writer APjt 
If ( Aki < c, ) 

Request a new appointment from LP^; 

For every reader LPj 

If ( [jPj has requested a new appointment ) 

Comimte and send a new appointment A{j\ 

} 

} I''or<‘V<!r 

Figure 2: Appointment Synchronization Pseudo-code 
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TIu' ahilil.y of l.liis protocol to rcdiico .syiirlironi/aiion ovorlioad l,o arcoptahlo levels clearly 
dep<'iids <111 Idle ahilil.y lo provide lookaliead. A (|ii<‘ii<'iiif> network ofl.iMi lias striicl me wliicli allows 
a <pieii<> (^4 to iieriodically provide ii|ip<>r hounds on the times at which it will route jobs to other 
(UKuies. The aggregation of these hounds form the basis of an appointment. The sections to follow 
show how various degrees of lookahead can be computed in queueing network simulations. 

3 Lookahead in FCFS Queueing Networks 

hookahead is easily computed in a stochastic .simulation of a network of FCFS queues. The sim- 
ulation is distrihutc'd by assigning queue,s to processors. Depending on the size of the queueing 
iK'twork, a processor may be assigned several queues. A processor is responsible for simulating 
the* (|iieiieing activity of each of its (pieiies, and for maintaining all statistical information collected 
about the queues’ behavior. An LP then consists of the possibly fragmented subnetwork assigned 
to a proi essor. It is important to note that past treatments of parallel queueing simulations have 
tr<*ated each (|ueue individually as an LP; this invariably leads to high overhead because synchro- 
nization costs are sufb'red on a per-/y/^ basis. 

A typical simulation of a (pieue rcsiuires three event handlers: AddToQueue, BeginService, and 
FinishService. 'I'he random service time of a job entering service is traditionally sampled by 
BoginSorvice, and the destina tion of the completed job is traditionally chosen by FinishService. 
A si'iial simula tion gains nothing by choosing the service time and branching destination any sooner 
than re<iuired. For the |)urposes of computing lookahead there is much to be gained by choosing 
them I'aiiier. Our ability to do so depends in large part on the model assumptions. In the simplest 
hut most common type of stochastic simulation the service time of every job at a queue is drawn 
from a common distribution and tin* branching destination is chosen from a common distribution. 
Not<> that these quantiti('s could be drawn at any time - it can be advantageous to select a job’s 
servici* tiim' and branching destination Ixfoir the job arrives. For example, if at time t queue Qa 
has no jobs enqueued for Qy but it is known that the next job which branches to has service 
time .s, then Qa will send no jobs to Qn before time caH) + s, where e^i (0 is the time at which 
Q A wdl next be empty if no further arrivals occur: / plus the sum of service times of all jobs in 
<iueue at time t. r a(!) + s is a sharj) bound if the next job arrives prior to time e^(/), and has Qb 
chosen as its branching destination. 
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'I'lu' ol)S(!rva(,ioii a bove led us to an organization wliich associates with every queue a future list 
of jobs which have not yet arrived. A job’s service time and l)ranching destination are determined 
wlien it joins the fntnrc' list. The future list is ke|>t large enough so that it contains a Job for every 
possible branching destination. When the event handler AddToQueue is called at simulation time / 
to simulate a. job arrival at Qai the first job in future list is removed and is used to represent 
tlie arrival. If that job branches to Qu, and its removal empties the future list of jobs which 
branch to Qh, then additional jobs arc appended to the future list in a manner which preserves 
the statistical integrity of the simulation — ^.jobs with randomly selected service times and branching 
d('stina.tions are api)ended to the future list until a job with destination Qb is added. Note also 
that onc<^ a. job Jb arrives at Qa its arrival time at the next queue Qb is already determined; 
cnnse<|uently the |)rocessor holding Qb may be immediately informed of Jb's arrival there. This 
is advantageous when Qa a,nd Qb reside in different processors, as it may allow Qb to simulate 
.//j’s arrival ahead (in real time) of its simulated departure from Qa- After computing Jb’s arrival 
time at Qb, we compute a lower bound on the time of Q^i’s next, as yet unseen job to Qb, called 
-iNrri- A description of -Incxi is found in Q/i’s future list. Because Qa is FCFS, we know that 
•Inij-i. cannot depart at least until all jobs current enqueued at Qa receive service, at time e^(t). 
l''urthermore, J^r-xt does not receive service until every job ahead of it in the future list receives 
service. Letting S he the sum of service times of all jobs ahead of and including JiWext in f'lm future 
list, I a{I) + ■‘i is tluMi a lower bound on the time that Qa will next route a job to Qb- This bound 
is cheai)ly computed, and is passed to Qb’s processor along with the message reporting the arrival 
of .//B h’igiirr! .'{ illustrates these |)oints, and a possible transformation of a queue and its future list 
upon the simulated arrival of a job. Figure -1 outlines the roles played by the the event handlers in 
this sclieimr. 

It is apparent from the description above that lookahead information is continually being com- 
|)uted and c'xchanged between <iueues. Ob.serve however that a bound h provided by Qa for Qb 
can become “stale” the bound is predicated on the assumption that the next job from Qa is 
routed as soon as possible; it is possible for the simulation clock in Qb’s processor to advance up 
to I) without another job being sent from Qa to Qb- In the absence of further jobs from Qa, and 
in th(^ absence; of active measures by Qa’s proce.ssor to compute a new bound on the time of the 
lu-xt job from Qa Io Qb, the appointment time a provided by Qa's processor to Qb’s processor 
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Future List 

Job : i.5 
Si-rvicc : 1 
liranch ; 2 


lob : j_4 
Scrvicf. : 10 
Hrandi : 3 


Job: 1,3 
Scrvico : 1 


Jobs in Queue 


Job : j_2 
Service : 7 
Branch : 2 


Job : j_l 
Service : 5 
Branch : 2 


Branch : 3 



simulation lime = I 
residual = I 




Jobs in Queue 


Job : j_l 
Service : 6 
Branch : sink 


Event List 


time : t + 8 

event : add_to_qucuc 

job : j_2 


Bound = t + 20 


lime : t + 7 
cvcnl:rinish_scrvicc 
job:j_l 


simulation time = 1 + 5 
residual = 2 




After arrival of j_2 at time t 


Figure .3: 'IVansformation of qiiouc and future list after a job arrival 
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AddToQueue (.Q , /) 

{ l{<'mov« lirsl. job Jj from front of Q'a futuro list; 

Append .// to otid of Q’s real (puMK'; 

If ( No jobs in future list that branch to ./y. Branch ) 

Repeat { 

(’reate job ./; 

Randomly chose service time J. Service; 

Randomly chose branch J. Branch; 

Ai)i)end ./ to end of Q’s future list; 

} Until (J. Branch = Jy. Branch) 

Jfi = first job in future list such that Jg. Branch = J/. Branch; 
.S' = Sum of future list service times through Jb\ 

A])l = S + f:A{t); 

Notify queue .//.Branch of arrival at time c^(/.); 

Notify queue .//.Branch of appointment at time Apt] 

Add BeginService event to event list at time e^it) — J/. Service; 

} 

BeginService (Q, t) 

{ ./ = First job in (^’s real (lueuc; 

(^unpute desired statistics; 

Ad<l FinisliService event to event list at time t + J. Service; 


FinishService iQ, 0 

{ Remove first job in Q'a real queue; 
C(»mpute desired statistics; 

} 
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fjiniiot, ('xc('(>.(l b. As Qh’s processor advances in simulation time it may find that the first event 
in its list has a time-stamj) larger than a. In this case a new appointment is requested from Q^’s 
lirncessor. It is eveiil ually innimlieiil. upon Q^’s processor to satisfy this request hy computing a 
iii'w appointment. 

hi’i can construct an aiipointinent for hl*j in several different ways. 'I'wo of the simplest ways 
are ili'scrilied below. 

1. ///’, scans all of the lati'st hounds its queues have already provided to queues in LPj. The 
appointment value is the h'ast of these. 

2. scans its event list to lind the first future job arrival to any one of its queues. It compares 
this time to the minimum ajipointment given to it by a writer LP, and denotes the minimum 
of these two values by m; this quantity is a lower bound on the time at which a job next 
arrives at any (|ueue in the IJ\ 'I’hen for every every pair of queues and Qb such that 
(J ,\ lives in LPi and Qb "> we compute a new bound. The new bound is computed on 
the assumption that the next job to arrive at (not necessarily with branching destination 
Q n) arrives at simulation time in. Letting Jncxi be the first job in Q^i’s future list with 
destination Qb and letting S be the sum of service times of jobs ahead and including Jncxi in 
the th<' future list, w<> compute the appointment value max{m,cvi(C',)} + 5. This appointment 
n'llecis the possibility of an arrival precisely at time m — the max term computes the earliest 
time at which the job represented by that arrival begins service in the (|iieue. Among all such 
bounds compiil.ed for all queues, the minimum is the new appointment. 

The first of these methods is the cheapest to compute, but will not not produce a usable appointment 
if any of the old bounds are stale. loirtherniore, LPj can compute this value for itself whenever 
it desires. 'I’lie sc'cond method uses more information (the value in) and so may produce better 
bounds at the cost of some additional computation. It is important to note though that even if some 
bounds are improved, the appointment improves only if the minimum bound is improved upon. It 
is also imiiortant to note that for any given queue, recomputing a bound with an increased value 
of m will not improvi’ the bound if m is less than the next known time that the queue could be 
em|)ty. The key idea, behind using future list to compute an appointment is to find a lower 
bound on the time at which the next job arrives at Q^. The second scheme is quite pessimistic 
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wlirn <()mi>Ml.inK Uiis homul. 11. is possible dial, valine th'liniiif' in is associated vvil.Ii a. (| 1 kmu* far 
removed from Q,\, aiiil dial, a imicli Ix'der bound on l.lie next, arrival al is jiossiblc. We liave 
implemented a. method wliicli analyzes the full simulation state in an LP in order to determine for 
('acli Qa the best possible bound on the time of its next arrival. Then for every writer/reader 
pair Q,\ Qh bound is constructed just like the one above, with taking the place of m in 
the calculation. 'I’lie minimum bound so calculated is the new appointment. A description and 
analysis of this method follows. 

We first freeze all incoming bounds to LPi's queues by making copies of their current values; 
Ihi.s eliminates any further effect that other proce.s.sors Ctan have on the forthcoming algorithm. 
I0v(Ty (lueue which reads from an off-processor queue has its min^apt value set to the minimum of 
its frozen off-procc'ssor bounds. Next, we scan the event list for job arrival events. Associated with 
each such event is a target (pieue; the arrival time is used to update the queue’s min.apt value if 
that value either exceeds the job arrival time, or is in the initial state. Following these initialization 
steps, every queue’s mhuapl value is either null, or is equal to the minimum time at which a job 
might arrive cither from off-jirocessor, or from the event list. The problem now is to analyze the 
elfects of job arrivals at those minimum times. This analysis is performed by essentially simulating 
till! ('fleets of job arrivals. For every queue with some value in its miujapl field we place in a shadow 
rvent list a shadow rvent which denot(5s a job arrival at time niiii-apt. The shadow-time-stamps of 
shadow-events taken off of the shadow-event list will be monotonically increasing. Proceeding with 
the shadow-simulation, we remove the minimum time shadow-event from the shadow-event list. If 
the sjiecified (|U(!ue has already been “touched” by the shadow-simulation we simply discard the 
shadow-event. Otherwise we consider the effects of a job arrival at the specified queue (say Qa)i 
at the sliadow-(!venl time. 'I’liis is accomplished by computing a bound for each of Qa's readers, 
based on the assumplion that a. job arriv(« at the shadow-arrival time. Shadow-events describing 
these arrivals are inserted into the shadow-event list, the queue is marked as having been touched 
by the shadow-simulation, and a count of “touched” queues is incremented. We are finished if this 
count ecpials the number of unfixed <pieues. Because the shadow-simulation simulates propagation 
of jobs through the network at the earlicvst po.ssible times, the shadow-time associated with the 
first touch of a queue by the shadow-simulation is a lower bound on the time of the next true job 
arrival at the (pieue. Once the shadow-simulation h;is finished it is a simple matter to compute 
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iK'w hounds for <|U(Mi(‘s in oUif'r procc'ssois hy iisiiif^ llio sli;uIow-jol) nrrival liinos. 

riic coinplrxily ol l.liis nu'l hod is /‘/'lof; h'), wlion* /'/’ is l.lio tiimilxT of iiil.or-(|ii('uo connections 
in the l,P. 'I'liis follows hecause any Kive.'i intcr-iinene link will have a simulated shadow-arrival 
schedided to < rf)ss it at most once (hecanse a qnene is touched at most once), and priority lists 
such as heaps exact a logarithmic, cost for each access. This complexity does not consider the cost 
of initializing the priority heap. Initialization requires that we determine each queue’s minimum 
incoming off-processor hound. Letting E denote the number of links from off-processor queues, this 
is achieved in 0(E) time. We must also determine for each queue whether there is a future job 
arriwU in the event list. It is possible to link events in a such a way that the first arrival event for 
any given queue is accessible in constant time. This endows the initialization phase with an 0(n) 
comiilexity, where n is the nnmher of queues on the LP. The O(ElogE) cost thus dominates. It 
is appropriate to point out that this method is similar in spirit to that discussed in [5]. Due to 
dilferences in the models and applications, (Iroselj and Tropper’s algorithm has a slightly smaller 
comiilexity ()(ji\ogn+E). 

Yet another approach to computing lookahead is quite general, and does not employ the inter- 
queue hounds at all; instead, it analyzes each processor’s event list. Imagine momentarily that all 
processors are temporarily inhibited from modifying their event lists. Let /,rjn be the minimum time 
stamp among all job arrival events on the event lists. Then clearly any appointment value a < tmin 
between any two procc'ssors can he increased to /min- This type of lookahead is equivalent to that 
proposed by Luhachevsky [9]; however, the “minimal propagation” delays his method depends on 
arc usually zero in general stochastic (pieneing networks. Lubachevsky’s method calls for global 
synch ronizations between processors so that can be found, and events which can be performed 
concurrently be identified. Our overall approach is asynchronous, and we prefer to avoid global 
synchronizations if possible. A lower bound on f,,,;,, can be constructed asynchronously under the 
assumption thal. messages between U\ and LPj are received in the order that they are sent. Let 
'l'onr\ ami 'l'nu(2 be two arrays such that 7'onc I, contains a sna])shot of of Ll\'s minimum job 
arrival event at some real time .S|,, Toac'li contains a snapshot of of L/Vs minimum job arrival 
evetit at .some rcuil time .s.j,-, and .s’l, < s-ij for any i and j. It can be shown that the minimum value 
in tin; 'I'onrA array is a lower bound on anji future job arrival event time, and is consequently a lower 
bound on any interj)rocessor appointment, 'fhe Tone arrays arc easily maintained by appending 
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minimal fuliirn job arrival tiim's onto m<'ssap,os cxcliaiiged betwrmii procossors. 'I'liis method is 
even easier to iinphnnent on a sliared-memory machine if the event lists are stored in common 
memory, one proc(\ssor can be solely dedicated to the task of collecting 'I'one values and updating 
stale appointments. 

VVe have implemented the second, third and fourth of these methods. The following section 
discusses their observed |)orformance. 

4 Performance Results 

We have implemented a parallel discrete-event queueing network simulation on NASA Langley’s 
I'’h'x/.T2 [10] miiltij)rocessor. 'I'he Flex/.T2 is a. bus-oriented shared-memory architecture which sup- 
ports both local and global memory. Our implementation takes advantage of the global memory- 
each processor’s (went list is in global memory, and one processor may insert an event into another’s 
list. Mutual exclusion is enforced using low-level primitives such as spin-locks. Data structures de- 
scribing the bounds between (iiienes and the appointments between processors are also organized 
in the global memory. 

The synchronization method employed to ensure simulation correctness is only one of a host 
of performance issiu's that must be addressed by a parallel simulator. In order to study the 
ellV’ctivem'ss of the synchronization method largely in isolation from other factors (such as load 
balancing), we haved studied simple, very homogeneous queueing networks which arise in the design 
of inter-proc(\ssor communication networks: rings, meshes, hypercubes, and multistage routing 
networks. We assunu' that every server in a network has the same service time distribution, and the 
same homogeneous branching probabilities. 'The studies we d(\scribe here concern closed networks of 
‘2r)(i nodes (except for .'ISd nodes in the multistage case) simulated using sixteen processors. Queue 
i is assigned to processor i mod n, where n is the number of processors. 

SjmkIu]) is the time reriuiix'd to solve the i)roblcm on a serial implementation divided by the 
time re(|uired by a parallel implementation. It is easy to use the parallel code on one processor 
as the serial version the nUjorilhmic speedup so calculated measures the method’s efficiency as a 
function of the nnndx'r of proce.ssors ii.scd. It does not however measure the end-user’s benefit from 
parallelism. 'Phis benefit can only be measured by comparing the performance of an optimized 
serial version with the parallel version. Our performance measurements are based on this latter 
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iiuNislinMilcnI. of spcH'd ii|); < li<’ o|)l iniizod .s<'rial version was croatod from l lio parallel version by re- 
moving all code related 1,o miil iial exclusion and synclironizalion, and by removing all computations 
r*>laled to tbe future tiucue. A comparison bctw'een tlic optimized serial version and the parallel 
version on one processor tells us something about the cost of a processor’s Internal overhead of 
doing |)arallel processing (e.g., calls to synchronization routines); it also gives us an upper bound 
on the speed ups we can expert. Kach of our performance graphs is marked with this upper bound 
1,0 better redert how »'fnci«uit the |)iogram is relative to its inescapable internal overhead. 

'The data structures and algorithms used t(j iiianage the event-list have a critical effect on per- 
formance. In the interests of rapid-prototyping we first implemented the event list as a naive, 
donbly-linkc'd list. Under moderate loads we achieved a speedup of 24 using 8 processors! This 
anomaly is simply explained by realizing that the serial version is sub-optimal (see [8] for a per- 
formance study of various list-management algorithms); anomalies cjf this type have been observed 
in other contexts [ l]. VVe snbseciuently implemented a simple, but more efficient list management 
algorithm by associating an ordered <iuenc of events with each individual queue, and then use a 
combining tree to identify the (went list with smallest minimal event. 

'I'he statistics collected by our iM'ogram are minimal: for each queue we maintain a 128 element 
histogram of job waiting times. Updating the histogram requires only a binary search to select a 
bin, and an increment. 

The ring topology allows a. (|iieue to send jobs to either a left or right neighbor; the mesh 
t(Ji)oIogy connects North, South, lOast, and West neighbors, and wraps around the edges to create 
a torus. The hypercube t(q)ology is the usual one; the multistage network consists of six stages, 
each of which has sixty-four <|ii('ues, and which feed forward to the next stage using the Hutterfly 
interconnection pattern. 'I’he last stage fe(<ds the first stage. 

All of our ex|)erimeiits employ sixteen |)roce.ssors. In one set of experiments we assume that the 
service time is exponential with mean /t = 1 .0; another set of experiments treats the service time as 
the constaid. 1.0. Mc'cause these networks areclo.s('d, the simulation load is varied by adjusting the 
number of jobs placed into the system. Hecanse of homogeneity the load can be described simply 
by //, the average number of jobs in queue at a server. For every topology and service distribution 
we varied // within the. set (1, 2, 4, 0,8, 10}. For each set of parameters we simulated the network 
ten tiinr's, starting from an initial conliguration where each queue has exactly u jobs in queue. 
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'I'Ik' simulal.ioii w;».s Icrmiiuilcd nfl.cr all i)i()C('.ssors had a.d vaiicc'd to simiilalion l.iino 100. Larger 
U'riiiina(,ioii limes would l>e dosirahh' if we were interested in accurate queueing network statistics; 
however, the timings on experiments with larger termination times scaled directly, required much 
more CPU time, and were subseciuently dropped. The execution time measurements exclude the 
1/0 tiim^ re([uired to initially load the prohlem, but include all other I/O required during the course 
of a run. Our performance curves plot intervals to represent speedup. The intention is to both 
show what sort of sp<‘eduj)s can be expected, and what variation there is in the speedup estimates. 
It is unrea-sonable to measure true speedup by inducing precisely the same branching and service 
time behavior in the serial and parallel versions. Instead, for each set of experimental parameters 
w<! im-asured the mean /t,, and standard deviation <Tp of ten parallel runs, and the mean fig and 
sample standard (bwiation o'., of ten serial runs. 'I'lien we plot an interval containing a. high speedup 
estimate, (/t., -P o-,)/(/jp - iTp), and a low speedup estimate, (fig — ag)/{fip + Cp). 

Figures .'> and (i presents the si)eedup intervals. Each graph’s title ha,s the form “Topol- 
ogy/Lookahead type/Distribution”; the lookahead type is Full or Border, depending on whether the 
lookahead calculation analyzed the full LB state or simply computed bounds at the queues which 
feed olf-processor rpiem's. A number of observations stand out. Ordered roughly by importance, 
they are: 

1 . Under moderat** to heavy simulation loads every graph approaches its optimal level (a speedup 
which temis to be clo.se to eleven). These experiments show that good speed ups arc sometimes 
possible in the.se types of simulations. If the simulation load is low the proportion of useful 
work to lookalx'ad computation has to diminish, yielding poor speedups. 

2. 'The service time variation has a strong elfect on speedup. Under high variation very small 
lookahead values are i)ossible, meaning that lookahead is computed more often, thereby in- 
curring incr<'ase<| overhead. 'I'his is in agreement with FujimOto’s experiments[3]. 

3. Network topology strongly affects performance under low loads. Ilypercubes have a richer in- 
lercounc'ction structure, which causes increased uncertainty in future behavior (meaning that 
lookahead bounds are not sharp). Under low loads and exponential service times simulation 
f)f hypercube interconnections performed poorly while other interconnections did somewhat 
better. 
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-I. Simulations of rings tend to liavc higher variance. This is understood by realizing that high 
workload in some network region does not easily disperse; the oth(!r topologies are better at 
s])reading jobs around the network. This understanding of the phenomenon is re-enforced by 
Reed’s observation[14] that concentrated chains of jobs tended to form in his simulations. 

5. The form of lookahead used (Border or Full) has a smaller effect on performance than we 
anticipated. In this set of experiments the cheaper form of lookahead (Border) uniformly 
performed better, but this effect was secondary when compared to the effects of service time 
distribution and topology. We hasten to recall though that the mapping of queues to pro- 
cessors forces every queue to feed a proportionally large number of off-processor queues, so 
that the lookahead gained by collecting additional information from on-processor queues is 
overshadowed by the cost of collecting that information. We did study two variations on 
the lookahead calculation which only analyzes event lists. In one variation we dedicated a 
processor to the task of searching for this type of lookahead while all other processors did 
simulation work. This scheme had very little impact on the execution times. In a second vari- 
ation we relied entirely on appointments computed by the auxiliary processor, and achieved 
comparatively poor speed ups, even under high loads. 

Two other points are of interest and are not shown in these graphs. Network size has some 
effect on performance; as expected, larger problems yield larger speed ups, although the speedup 
still depends most heavily on the average queue length and the service time distribution. Secondly, 
we measured the number of times the lookahead analysis algorithm is called in the course of a 
simidation run. Under high loads {u = 16) the analysis routine is never called: the ordinary 
lookahead computed with every arrival to a queue sustains the progress of the simulation. 

We reiterate the main conclusion that we can draw from this data: at least under limited 
circumstances it is possible to achieve good real speedups by using a conservative synchronization 
mechanism which exploits the problem being simulated. 

5 Summary 

'I’lie parallelization of discrete-event simulations has proven to be a difficult problem, due in large 
part to extensive and irregular synchronization requirements. One means of alleviating that syn- 
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cliroiiizalioii hiirdcii is to liavc processors analyze llieir siimilation state and compute lookahead 
lower lionnds on times at wliicli tliey perrorin actions that directly alh'Ct the evrmt lists of other 
processors. We illustrate this techni<iiie on the knotty prohle.m of stochastic (lucucing network 
simulations. 'I'ln'se simnlations are particularly difricult because their intrinsic computation to syn- 
chronization cost ratio is so dis-advantageous. We show how the sinuilation can be re-organized 
to alhjw lookahead to be computed for FCFS queueing networks, discuss trade-ofTs between the 
(piality of loofcihead and the cost of providing it, and demonstrate the effectiveness of the method 
by implementation on several common queueing network topologies. This result stands in contrast 
with |)revioiis studies which used synchronization mechanisms that are largely unaware of the un- 
derlying simulation problem. Generality in a synchronization mechanism is a worthy goal, but the 
price of that goal may be poor i)erformance. 
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